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Table 1: Four scenarios when reasoning under uncertainty.^ 


Multi-armed bandits 

Reinforcement 

Learning 

Decision theory 

Markov Decision 

Process 


Actions don’t change Actions change state 
state of the world of the world 


1. Introduction 

In a decision making process, agents make decisions based on observations of the world. 
Tabled] describes four scenarios when making decisions under uncertainty. In a multi-armed 
bandits problem, the model of outcomes is unknown, and the outcomes can be stochastic 
or adversarial; Besides, actions taken won’t change the state of the world. 

In this survey we focus on multi-armed bandits. In this problem the agent needs to 
make a sequence of decisions in time I,2,...,T. At each time t the agent is given a set 
of K arms, and it has to decide which arm to pull. After pulling an arm, it receives a 
reward of that arm, and the rewards of other arms are unknown. In a stochastic setting the 
reward of an arm is sampled from some unknown distribution, and in an adversarial setting 
the reward of an arm is chosen by an adversary and is not necessarily sampled from any 
distribution. Particularly, in this survey we are interested in the situation where we observe 
side information at each time t. We call this side information the context. The arm that 
has the highest expected reward may be different given different contexts. This variant of 
multi-armed bandits is called contextual bandits. 

Usually in a contextual bandits problem there is a set of policies, and each policy maps 
a context to an arm. There can be infinite number of policies, especially when reducing 
bandits to classification problems. We define the regret of the agent as the gap between the 
highest expected cumulative reward any policy can achieve and the cumulative reward the 
agent actually get. The goal of the agent is to minimize the regret. Contextual bandits can 
naturally model many problems. For example, in a news personalization system, we can 
treat each news articles as an arm, and the features of both articles and users as contexts. 
The agent then picks articles for each user to maximize click-through rate or dwell time. 

There are a lot of bandits algorithms, and it is always important to know what they are 
competing with. For example, in K-armed bandits, the agents are competing with the arm 
that has the highest expected reward; and in contextual bandits with expert advice, the 
agents are competing with the expert that has the highest expected reward; and when we 
reduce contextual bandits to classihcation/regression problems, the agents are competing 
with the best policy in a pre-defined policy set. 


1. Table from CMU Graduate AI course slides. http://www.cs.cmu.edu/~15780/lec/10-Prob-start-mdp.pdf 
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As a overview, we summarize all the algorithms we will talk about in Table [2j In this 
table, C is the number of distinct contexts, N is the number of policies, K is the number 
of arms, and d is the dimension of contexts. Note that the second last column shows if the 
algorithm requires the knowledge of T, and it doesn’t necessary mean that the algorithm 
requires the knowledge of T to run, but means that to achieve the proposed regret the 
knowledge of T is required. 
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Table 2: A comparison between all the contextua’ 

l bandits algorithm we will talk about 

Algorithm 

Regret 

With Might 
Probability 

Can Have 
Infinite Policies 

Need to 
know T 

adversarial 

reward 

Reduce to MAB 

0 

i^VTCK In or 0 (^VTN In n'^ 

no 

no 

yes 

yes 

EXP4 

0 

i^VTK In n'^ 

no 

no 

yes 

yes 

EXP4.P 

0 

[^TKlniN/S)) 

yes 

no 

yes 

yes 

LinUCB 

0 

[dVTln{{l + T)/5)) 

yes 

yes 

yes 

no 

SupLinUCB 

0 

(^^Tdln^{KT\nT/S)^ 

yes 

yes 

yes 

no 

SupLinREL 

0 

[V^{1 + ln(2A:TlnT/(5))3/2^ 

yes 

yes 

yes 

no 

GP-UCB 

d 


yes 

yes 

yes 

no 

KernelUCB 

Oi^BdT) 

yes 

yes 

yes 

no 

Epoch-Greedy 

0 ((A:in(iV/5))V3T2/3) 

yes 

yes 

no 

no 

Randomized UCB 

Oi 

[y^TKlniN/d)) 

yes 

yes 

no 

no 

ILOVETOCONBANDITS 

0\ 

[./TKHim) 

yes 

yes 

no 

no 

Thompson Sampling 
with Linear Regression 

0\ 

(^VT^+^{ln{Td) Ini)) 

(e ^ d y J 

yes 

yes 

no 

no 



2. Unbiased Reward Estimator 

One challenge of bandits problems is that we only observe partial feedback. Suppose at 
time t the algorithm randomly selects an arm at based on a probability vector pt. Denote 
the true reward vector by rt E [0,1]^ and the reward vector we observed by E [0,1]^, 
then all the elements in r[ are zero except r[ which is equal to rt^at ■ Then r[ is certainly 
not a unbiased estimator of rt because E(rj = pa^ ■ rt^at / n,af A common trick to this 
is to use rt^at = '^'t,atlPa.t instead of In this way we get a unbiased estimator of the true 
reward vector rp. for any arm a 

E(n,a) = Pa • rt,a/Pa + {I - Pa) * 0 
= rt,a 

The expectation is with respect to the random choice of arms at time t. This trick is used 
by many algorithms described later. 


3. Reduce to K-Armed Bandits 


If it is possible to enumerate all the contexts, then one naive way is to apply a K-armed 
bandits algorithm to each context. However, in this way we ignore all the relationships 
between contexts since we treat them independently. 

Suppose there are C distinct contexts in the context set A, and the context at time t 
is xt E {1,2, ..., C}. Also assume there are K arms in the arm set A and the arm selected 
at time t is at E {1,2,..., A}. Define the policy set to be all the possible mappings from 
contexts to arms as H = {/ : A —)• A}, then the regret of the agent is defined as: 


Rt 


sup E 
/en 


■ T 

- rt,at) 

.t=l 


( 1 ) 


Theorem 3.1 Apply EXP3 l Auer et oL . 200211 ). a non-contextual multi-armed bandits al¬ 
gorithm, on each context, then the regret is 

Rt < 2.Q2,^/TCK\aK 

Proof Define n* = Yl"t=i = *), then = T. We know that the regret bound of 

EXP3 algorithm is 2.63\/TAln K, so 


Rt = sup E 
/en 


- rt,at) 

T 

= i){rtj{xt) - n,at) 


Lt=l 


lt=l 

C V T 

= > sup E 

c _ 

< 2.63y ^UiK In K 

i=l 

< 2.63VTCKInK (Cauchy-Schwarz inequality) 
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One problem with this method is that it assumes the contexts can be enumerated, which 
is not true when contexts are continuous. Also this algorithm treats each context indepen¬ 
dently, so learning one of them does not help learning the other ones. 

If there exists is a set of pre-defined policies and we want to compete with the best 
one, then another way to reduce to K-armed bandits is to treat each policy as an arm and 
then apply EXP3 algorithm. The regret is still defined as Equation ([1]), but 11 is now a 
pre-defined policy set instead of all possible mappings from contexts to arms. Let N be 
the number of polices in the policy set, then by applying EXP3 algorithm we get the regret 
bound 0{VTN In N). This algorithm works if we have small number of policies and large 
number of arms; however, if we have a huge number of policies, then this regret bound is 
weak. 


4. Stochastic Contextual Bandits 

Stochastic contextual bandits algorithms assume that the reward of each arm follows an 
unknown probability distribution. Some algorithms further assume such distribution is 
sub-Gaussian with unknown parameters. In this section, we first talk about stochastic con¬ 
textual bandits algorithms with linear realizability assumption; In this case, the expectation 
of the reward of each arm is linear with respect to the arm’s features. Then we talk about 
algorithms that work for arbitrary set of policies without such assumption. 


4.1 Stochastic Contextual Bandits ^vith Linear Realizability Assumption 

4.1.1 LinUCB/SupLinUCB 

LinlJCB ( Li et al.l . boiol : IChu et all , boill ) extends UCB algorithm to contextual cases. 
Suppose each arm is associated with a feature vector xt^a £ In news recommendation, 
xt^a could be user-article pairwise feature vectors. LinUCB assumes the expected reward of 
an arm a is linear with respect to its feature vector xt^a £ 




= ^laO* 


where 0* is the true coefficient vector. The noise et^a is assumed to be R-sub-Gaussian 
for any t. Without loss of generality, we assume ||0*|| < S and ||xt^a|| < L, where || • || 
denotes the £ 2 -norm. We also assume the reward rt^a < 1- Denote the best arm at time t 
by = argmax^ and the arm selected by the algorithm at time t hy at, then the 

T-trial regret of LinUCB is defined as 


Rt = E ^rt,a*-^rt,c 
j=i t=i 


t=i 


t=i 


Let Dt G and ct € R* be the historical data up to time t, where the row of Dt 
represents the feature vector of the arm pulled at time i, and the row of ct represents the 
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corresponding reward. If samples {xt^a, ^t,at) independent, then we can get a closed-form 
estimator of 9* by ridge regression: 


9t = {DjDt + \ld)-^Dlct 


-1 nT, 


The accuracy of the estimator, of course, depends on the amount of data. IChu et al.l ((20111) 
derived a upper confidence bound for the prediction 


Theorem 4.1 Suppose the rewards rt^a are independent random variables with means E[rt^a] = 
xjg^d*, let e = and At = DJ Dt + Id then with probability 1 — 5/T, we have 


\xjjt - xlj*\ < (e + l)^Jxl^A^ ^xt,a 

LinUCB always selects the arm with the highest upper confidence bound. The algorithm is 
described in Algorithm [TJ 


Algorithm 1 LinUCB 
Require: a > 0, A > 0 
A = Aid 
b = 0d 

for t=l, 2, ..., T do 

9t = A-^b 

Observe features of all K arms a E At : xt,a £ 
for a=l, 2, ... K do 

St,a = ^t,a^t T 

end for 

Choose arm at = argmax^st^aj break ties arbitrarily 
Receive reward rt E [0,1] 

A = A + Xt,axla 

b = b + xt,art 

end for 


However, LinUCB algorithm use samples from previous rounds to estimate 9* and then 


pick a sample for current round. So the samples are not independent. In lAbbasi-Yadkori et al. 
(l201lh it was shown through martingale techniques that concentration results for the pre¬ 
dictors can be obtained directly without requiring the assumption that they are built as 
linear combinations of independent random variables. 


Theorem 4.2 (|Abbasi-Yadkori et al.l ((201 ih ) Let the noise term et,a be R-sub-Gaussian 
where R> 0 is a fixed constant. With probability at least 1 — 6, \/t > 1, 


\\9t - r lU, <R^21og 

We can now choose appropriate values of at for LinUCB as the right side of the inequality 
in Theorem 14.21 Note that here a depends on t, we denote so it is a little different than the 
original LinUCB algorithm (Algorithm [T|) which has independent assumption. 
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Theorem 4.3 Let X > max(l,L^). The cumulative regret of LinUCB is with probability at 
least 1 — 5 bounded as: 


Rt < \/Td\og{l + TL‘^/{dX))x 

X (i?V^^log(l + rL2/(Ad)) + 21og(l/5) + X^/‘^S) 


To pr oof Theorem 14.31 We first state two technical lemmas from lAbbasi-Yadkori et al 

(B) : 


Lemma 4.4 (|Abbasi-Yadkori et al.l (j201lh f We have the following bound: 

T 


^\\xt\\l-i < 2 log 


t=i 


A 


Lemma 4.5 (jAbbasi-Yadkori et al.l ([201l|)) The determinant \At\ can be bounded as: 

I Ail < (A + tLVd)'^. 


We can now simplify at as 

at < i?^ 21 og(|Ai|V 2 A-i/ 25 -i) + A ^/25 

< i?V^^log(l+ rL 2 /(Ad)) +21og(l/5) + A ^/25 

where d > 1 and A > max(l,L^) to have A^^'^ > A. 

Proof [Theorem 14.3j Let ft denote the instantaneous regret at time t. With probability 
at least 1 — 5, for all t: 


n = xjj* - xje* 

<xj9t + at||xt||^-i - xfd* 

<xj9t + at||xt||^-i - xj9t + ai||xt||^-i 

= 2at|ki||^-i 


( 2 ) 

(3) 


The inequality ([2|) is by the algorithm design and reflects the optimistic principle of 
LinUCB. Specifically, xf9t + at||a:*||^-i < xj9t + ai||xi||^-i, from which: 

xj9* <xj9t + ai||x*||^-i <xj9t + ai||xi||^-i 


In dSf), we applied Theorem 14.21 to get: 

















Finally by Lemmas 14.41 and 14.51 


II 


t=l 

< 2aT>,^ 


< 2aT\^ 

/riog^ 

< 2aT^/T{d\og{\ + TL‘^/d)-log \) 

< 2aT\/Td\og{l + TL‘^/{d\)) 


Above we used that at < ot because at is not decreasing t. Next we used that A > 
max(l,L^) to have > A. By plugging at, we get: 

Rt < ^yTdlog{l + TL‘^/{dX))x 

X (i?Vdlog(l + TL2/(Ad)) + 21og(l/(i) + Ai/ 25 ) 

= 0{d^Tlog{{l + T)/6)) 


Inspired by Auer ( 2003l i. Chu et al. (2011) proposed SupLinUCB algorithm, which is a 
variant of LinUCB. It is mainly used for theoretical analysis, but not a practical algorithm. 
SupLinUCB constructs S sets to store previously pulled arms and rewards. The algorithm 
are designed so that within the same set the sequence of feature vectors are fixed and the 
rewards are independent. As a results, an arm’s predicted reward in the current round is 
a linear combination of rewards that are independent random variables, and so Azuma’s 
ine quality can be us ed to get the regret bound of the algorithm. 


Chu et al.l (j201lh proved that with probability at least 1 — 5, the regret bound of Su¬ 
pLinUCB is O (^-^Tdln^{KTln{T)/6) 

4.1.2 LinREL/SupLinREL 


The problem setting of LinREL (Auer, 2nn.'ll ) is the same as LinUCB, so we use the same 
notations here. LinREL and LinUCB both assume that for each arm there is an associated 
feature vector xt^a and the expected reward of arm a is linear with respect to its feature 
vector: 'E[rt^a\xt,a] = xj^6*, where 9* is the true coefficient vector. However, these two 
algorithms take two different forms of regularization. LinUCB takes a (.2 regularization 
term similar to ridge regression; that is, it adds a diagonal matrix AI^ to matrix DJDt. 
LinREL, on the other hand, do regularization by setting DJDt matrix’s small eigenvalues 
to zero. LinREL algorithm is described in Algorithm [2j We have the following theorem 
to show that Equation ([5]) is the upper confidence bound of the true reward of arm a at 
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time t. Note that the following theorem assumes the rewards observed at each time t are 
independent random variables. However, similar to LinUCB, this assumption is not true. 
We will deal with this problem later. 


Algorithm 2 LinREL 

Require: 6 E [0,1], number of trials T. 

Let Dt E and q E R* be the matrix and vector to store previously pulled arm 

feature vectors and rewards, 
for t=l, 2, ..., T do 

Calculate eigendecomposition 

DjDt = UjdiagiXl,Xl...,Xf)Ut 

where ,..., > 1, A^"''^,..., Xf < 1, and ■ Ut = ld 

Observe features of all K arms a E At : xt^a £ 
for a=l, 2, ... K do 


iXt,ai 

-Pi,a) 

= 

UtXf^a 

iXt,ai 

•••) "'"i,a5 

0 ,. 


(0,..., 

Xt,a 

? ••• 


(^iTa 

■ diag ^ 

N 

1 


0 ,..., Oj ■ Ut ■ 

St,a = wJaCt + \\wt,a\\ (^\/ln{2TK/S)^ + pi,all 


T 


end for 

Choose arm at = arg maXa st,a , break ties arbitrarily 
Receive reward rt E [0,1], append xt,a and rt,a to Dt and ct- 

end for 


(4) 

(5) 


Theorem 4.6 Suppose the rewards rr,a,T E — 1 are independent random variables 

with mean E[xr,a] = Then at time t, with probability 1 — S/T all arm a ^ At satisfy 

- xIJ*\ < Ipi^all (\/21n(2TR:/(5)) + pi,a|| 

Suppose DJDt is invertible, then we can estimate the model parameter 9 = [DJDt)~^D~^ct- 
Given a feature vector xt,a, the predicted reward is 

rt,a = xJJ = {xl,{DjDt)-^D^)ct 

So we can view rt^a as a linear combination of previous rewards. In Equation (j3|), wt,a 
is essentially the weights for each previous reward (after regularization). We use to 
denote the weight of reward rr,a- 
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Proof [Theorem 14.6j Let Zr = rt^a ■ then j-^rl < 

t-i t-i 

T = 1 T=1 

t-1 t-1 t-1 

X]E[^r|2l,.-,2r-l] = ^E[Z^] =J2xJ,a&* ' <a 

r=l T=1 r=l 

Apply Azuma’s inequality we have 

P ■ wl^ > \\wt,a\\ (^^/2ln{2TK/S)^'^ 

= P [wl^ct - <,Ar > ||u;i,a|| (V21n(2riL/<5))) 

“ TK 

Now what we really need is the inequality between wj^ct and xJ^O*. Note that 

^t,a — Xf^a 

— “1“ Xt^a 

= DjDt{D^Dt)-^Djwt,a + Ujvt^a 

— '^t,a Uf Vt,a 

Assuming ||0*1| < 1, we have 

r 

p [wlaCt - xIJ* > ||w;t,a|| (^^J2ln{2TK/6)'^ + ||i}t,a||) < ^ 

Take the union bound over all arms, we prove the theorem. ■ 


The above proof uses the assumption that all the rewards observed are independent 
random variables. However in LinREL, the actions taken in previous rounds will influence 
the estimat ed 6, and thus influence the decision in current round. To deal with this problem, 
Auer ( 200.1! ) proposed SupLinREL algorithm. SupLinREL construct S sets 'Ll, each 

set Tf contains arm pulled at stage s. R is designed so that the rewards of arms inside 
one stage is independent, and within one stage they apply LinREL algorithm. They proved 
that the regret bound of SupLinREL is O + ln(2iLTlnT))^/^y 


4.1.3 CofineUCB 

4.1.4 Thompson Sampling with Linear Payoffs 

Thompson sampling is a heuristic to balance exploration and expl oitation, and it achieve s 
good empirical results on display ads and news recommendation (IChapelle and Lil . 120111 ). 
Thompson sampling can be applied to both contextual and non-contextual multi-armed 
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bandits problems. For example Agrawal and Goval ( 2013bl i provides a 0{VNT InT) regret 
bound for non-contextual case. Here we focus on the contextual case. 

Let V be the set of past observations {xt,at,rt), where xt is the context, at is the arm 
pulled, and rt is the reward of that arm. Thompson sampling assumes a parametric likeli¬ 
hood function P(r|a, x,0) for the reward, where 9 is the model parameter. We denote the 
true parameters by 9* . Ideally, we would choose an arm that maximize the expected reward 
maxa E(r|a, X, 0*), but of course we don’t know the true parameters. Instead Thompson 
sampling apply a prior believe P{9) on parameter 9, and then based on the data observed, 
it update the posterior distribution of 9 by P{9\'D) oc P{9) fl^i Pip\xt,o-t,9). Now if we 
just want to maximize the immediate reward, then we would choose an arm that maxi¬ 
mize E(r|a, x) = J E{a,x,9)P{9\'D)d9, but in an exploration/exploitation setting, we want 
to choose an arm according to its probability of being optimal. So Thompson sampling 
randomly selects an action a according to 


/ 


I 


E{r\a,9) 


maxE(r|a^, 9) 

a' 


P{9\V)d9 


In the actual algorithm, we don’t need to calculate the integral, it suffices to draw a random 
parameter 9 from posterior distribution and then select the arm with highest reward under 
that 9. The general framework of Thompson sampling is described in Algorithm [3l 


Algorithm 3 General Framework of Thompson Sampling 
Define V = {} 
for t = 1,..., T do 
Receive context xt 

Draw 9t from posterior distribution P{9\T)) 

Select arm at = argmax„ E(r|xt, a, 9t) 

Receive reward rt 

V = VU {xt,at,rt} 

end for 


According to the prior we choose or the likelihood function we use, we can have different 
var iants of Thompson samplin g. In the following section we introduce two of them. 


Agrawal and Govall (j2013a|) proposed a Thompson sampling algorithm with linear pay¬ 


offs. Suppose there are a total of K arms, each arm a is associated with a d-dimensional 
feature vector xt^a at time t. Note that xt^a / Xt'^a- There is no assumption on the distri¬ 
bution of X, so the context can be chosen by an adversary. A linear predictor is defined 
by a d-dimensional parame ter /r G R'^, and predicts the mean reward of arm a by /r • xt^a- 
Agrawal and Govall ( 2ni3al ) assumes an unknown underlying parameter fj,* G R'’* such that 
the expected reward for arm a at time t is ft,a = • xt,a- The real reward rt^a of arm a at 

time t is generated from an unknown distribution with mean ft,a- At each time t G {1,..., T} 
the algorithm chooses an arm at and receives reward rt. Let a* be the optimal arm at time 
t: 


a*t = argmaxrt^a 
a 
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and At^a be the difference of the expected reward between the optimal arm and arm a: 

At^a — '^t,a 

Then the regret of the algorithm is defined as: 

T 

-Rr = ^ At,at 

t=l 

In the paper they assume 5t,a = ?’t,a — ^t,a is conditionally R-sub-Gaussian, which means 
for a constant ii > 0, rt,a £ {ft,a — R,fr,t + R]- There are many likelihood distributions 
that satisfy this R-sub-Gaussian condition. But to make the algorithm simple, they use 
Gaussian likelihood and Gaussian prior. The likelihood of reward ft,a given the context xt,a 

is given by the pdf of Gaussian distribution Af . v is dehned as u = R^J ^(iln(|), 

where e G (0,1) is the algorithm parameter and 6 controls the high probability regret bound. 
Similar to the closed-form of linear regression, we dehne 

t-i 

Rt ^ ^ ^T,aX-,-,a 

T=\ 

At — Rt ( ^ ^ X-j-,aT-j-,a 

Then we have the following theorem: 

Theorem 4.7 if the prior of fi* at time t is defined as N'{fit,v^R^^), then the posterior of 
H* is 

Proof 


Pitkin,a) OC P{rt,a\tl)P{tl) 


OC exp - tJ^Xt,a + (T “ Rtih^ “ At) 

OC exp Bt+ipL - 2iJ^Bt+ifit+i) 

OC exp - fit+i)^Bt+i{u - fit+i) 

OC AA(/ij+i,u^Ri"\) 


Theorem 14.71 gives us a way to update our believe about the parameter after observing new 
data. The algorithm is described in Algorithm 01 


Theorem 4.8 With probability 1 — <5, the regret is bounded by: 

Rt = 0 


_ 1 

^VrPP^iHTd) In-; 
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Algorithm 4 Thompson Sampling with Linear Payoff 
Require: 6 G (0,1] 

Define v = B = Id, fi = Od, f = Od 

for t = 1, 2...,T do 

Sample ut from distribution 
Pull arm at = arg max^ 

Receive reward r* 

Update: 


B = B + Xt^axJ^a 
f = f + Xt,art 
fl = B-^f 


end for 


Chaoelle and Lil (|201l|) described a way of doing Thompson sampling with logistic re¬ 


gression. Let w be the weight vector of logistic regression and Wi be the element. Each Wi 
follows a Gaussian distribution Wi ~ M{mi, q~^)- They apply Laplace approximation to get 
the posterior distribution of the weight vector, which is a Gaussian distribution with diago¬ 
nal covariance matrix. The algorithm is described in Algorithm [5l iGhaoelle and Lil ((20111) 
didn’t give a regret bound for this algorithm, but showed that it achieve good empirical 
results on display advertising. 


4.1.5 SpectralUCB 

4.2 Kernelized Stochastic Contextual Bandits 

Recall that in section ITTl we assume a linear relationship between the arm’s features and the 
expected reward: E(r) = x~^6*-, however, linearity assumption is not always true. Instead, 
in this section we assume the expected reward of an arm is given by an unknown (possibly 
non-linear) reward function / : R'’^ —R: 


r = f{x) -F e (6) 

where e is a noise term with mean zero. We further assume that / is from a Reproducing 
Kernel Hilbert Spaces (RKHS) corresponding to some kernel k{-, •). We define 0 : R'’^ —^ 
as the mapping from the domain of x to the RKHS B, so that f{x) = (/, 0(x))'^. In the 
following we talk about GP-UCB/GGP-UCB and KernelUGB. GP-UCB/GGP-UCB is a 
Bayesian approach that puts a Gaussian Process prior on / to encode the assumption of 
smoothness, and KernelUGB is a Frequentist approach that builds estimators from linear 
regression in RKHS B and choose an appropriate regularizer to encode the assumption of 
smoothness. 
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Algorithm 5 Thompson Sampling with Logistic Regression 
Require A > 0, batch size 5 > 0 

Define D = {}, = 0, g* = A for all elements in the weight vector w G R'^. 

for each batch b = 1, ...,B do > Process in mini-batch style 
Draw w from posterior distribution M{m, diag{q)~^) 

for t = 1,S do 

Receive context Xb,t,j for each article j. 

Select arm at = argmax^ 1/(1 -|- exp{—Xh^t,j ' w)) 

Receive reward rt G {0,1} 

— Du {xht^at ) ) D} 

end for 

Solve the following optimization problem to get w 

Qiiwi - rriif + ^ ln(l-h exp(-rt() • x)) 

*=1 (x,r)gl> 

Set prior for next block 

rrii = Wi 

qi = qi+ x‘fpj{l-pj),pj = {l + eicp{-w-x))~^ 

(x,r)£V 


end for 





4.2.1 GP-UCB/CGP-UCB 

The Gaussian Process can be viewed as a prior over a regression function. 

f{x) ~ GP{^i{x),k{x,x')) 

where /u(x) is the mean function and k{x, x') is the covariance function: 

Ai(x) = E(/(x)) 

k{x, x’) = E ((/(x) - /i(x))(/(x') - /i(x'))) 

Assume the noise term e in Equation Q follows Gaussian distribution AA(0, cr^) with some 
variance cr^. Then, given any finite points {xi,..., xw}, their response rjv = [ri,..., rAr]"*" 
follows multivariate Gaussian distribution: 

vn ~ Af{[ii{xi), ..., ^i{xn)V,Kn + ctHn) 

where = k{xi, Xj). It turns out that the posterior distribution of / given {xi,..., x^r} 

is also a Gaussian Process distribution GP{fii\f{x), k]\f{x, x')) with 

/iAr(x) = kN{x)~^ {Kn + cr^I)“Vw 
fcw(x, x') = k{x, x') — k^ix)^ {Km + (T^I)“^fcjv(xO 


where k^ix) = [/c(xi, x),..., fc(xjv, x )]"*". 

GP-UCB ( Srinivas et ah . 201(11 ) is a Bayesian approach to infer the unknown reward 
function /. The domain of / is denoted by P. P could be a finite set containing \P\ 
d—dimensional vectors, or a infinite set such as R'^. GP-UCB puts a Gaussian process prior 
on / : / GP{fi{x), k{x, x')), and it updates the posterior distribu tion of / after each 


observation. Inspired by the UCB-style algorithm (|Auer et al 
xt at time t with the following strategy: 


2002al1 . it selects an point 


xt = argmaxfit-i{x) + ^/^at-l{x) (7) 

x&V 

where ^t-i{x) is the posterior mean of x, cj|_^(x) = kt-i{x,x), and /3t is appropriately 
chosen constant. © shows the exploration-exploitation tradeoff of GP-UCB: large fit-i{x) 
represents high estimated reward, and large at-i{x) represents high uncertainty. GP-UCB 
is described in Algorithm [6l 


Algorithm 6 GP-UCB 
Require: /tq = 0, ao, kernel k 

for t = 1, 2,... do 

select arm at = argmax„g_4//t_i(xt,a) + v%(Xt-i{xt,a) 
receive reward rt 

Update posterior distribution of /; obtain ^t and at 

end for 
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The regret of GP-UCB is defined as follow: 


T 

Rt = - f{xt) 


t=l 


( 8 ) 


where x* = f(x). From a bandits algorithm’s perspective, we can view each 

data point x in GP-UCB as an arm; however, in this case the features of an arm won’t change 
based on the contexts observed, and the best arm is always the same. We can also view 
each data point x as a feature vector that encodes both the arm and context information, 
however, in that case x* in Equation (l 8 |) becomes x* = argmax 2 ,g 25 ^ f{x) where Vt is the 
domain of / under current context. 

Define l{rA',f) = H(r^) — H(rA|/) as the mutual information between / and rewards 
of a set of arms A &T>. Define the maximum information gain 75 - after T rounds as 


A-.\A\=T 


Note that 7 ^ depends on the kernel we choose. ISrinivas et al.l (120101 ) showed that if 


Pt = 2 ln(jF t^ 7 r^/ 6 ( 5 ), theii GP-U CB achieves a regret bound of O [y/T'jT In K) with high 
probability. Srinivas et ah ( 2O10l l also analyzed the agnostic setting, that is, the true func¬ 
tion / is not sampled from a Gaussian Process prior, but has bounded norm in RKHS: 


Theorem 4.9 Suppose the true f is in the RKHS H corresponding to kernel k{x,x'). As¬ 
sume {f,f)H < B- Let Pt = 2B + 3007i ln^(f/<5), let the prior be GP{0,k{x,x')), and the 
noise model 6e AA( 0 , ci^). Assume the true noise e has zero mean and is bounded by a almost 
surely. Then the regret bound of GP-UCB is 

Rt = O (^Vt {By/^ -\- 7r)^ 


with high probability. 


Srinivas et al. ( 2O10l l also showed the bound of 7 ^ for some common kernels. For finite 


dimensional linear kernel jt = OfdlnT); for squared exponential kernel jt = ©((InT)'^'*'^). 

CGP-UCB ( Krause and One].l2011 1 extends GP-UCB and explicitly model the contexts. 
It defines a context space Z and an arm space P; Both Z and T> can be infinite sets. CGP- 
UCB assumes the unknown reward function / is dehned over the join space of contexts and 
arms: 


r = f{z,x) -l-e 


where z & Z and x G P. The algorithm framework is the same as GP-UCB except tha.t 
now we need to choose a kernel k over the joint space of Z and P. iKrause and Ond ( 201 il l 
proposed one possible kernel k{{z,x},{z',x'}) = kz{z, z')kv{x,x'). We can use different 
kernels for the context spaces and arm spaces. 
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4.2.2 KernelUCB 


KernelUCB ( Valko et al. . 20131 ) is a Frequentist approach to learn the unknown reward 
function /. It estimates / using regularized linear regression in RKHS corresponding to 
some kernel We can also view KernelUCB as a Kernelized version of LinUCB. 

Assume there are K arms in the arm set A, and the best arm at time t is a* = 
argmax^g_ 4 /(xt^a), then the regret is defined as 


^T = Y^ - f{xt,at) 


t=l 


We apply kernelized ridge regression to estimate /. Given the arms pulled {xi,..., xt_i} 
and their rewards rt = [ri, up to time t — 1, define the dual variable 

at = + 

where {Kt)ij = k{xi,Xj). Then the predictive value of a given arm xt^a has the following 
closed form 


fixt,a) = kt{xt^a) at 

where kt{xt^a) = [k{xi,xt^a)^ ■■■ik{xt-i,Xt^a)V■ Now we have the predicted reward, we need 
to compute the half width of the conhdence interval of the predicted reward. Recall that 

in LinUCB such half width is dehned as similarly in kernelized 

ridge regression we define the half width as 


at,a = (t){xt,ay ^(t>{xt,a) 

where (j){-) is the mappin g from the domain o f x to the RKHS, and 4?^ = [(/i(xi)"'', 
In order to compute Q, Valko et al. ( 20131 ) derived a dual representation of ([ 

at,a = kixt,a, Xt,a) “ kt{xt,ay {Kt + A)~^kt{xt,a) 

KernelUCB chooses the action at at time t with the following strategy 

at = ai:gmax(kt{xty^at + r]at,a) 

a&A ^ ^ 


(9) 

yixt-iVy. 


where rj is the scaling param eter. 


To derive regret bound, Valko et al. ( 20131 ) proposed SupKernelUCB based on Ker¬ 


nelUCB, which is similar to the relationship between SupLinUCB and LinUCB. Since the 
dimension of 4>{x) may be infinite, we cannot directly apply LinUCB or SupLinUCB’s re¬ 


gret bound. Instead, Valko et al. ( 20131 ) defined a data dependent quantity d called effective 


dimension: Let {Xi,t)i>i denote the eigenvalues of $7+ 7 I in decreasing order, dehne d 


as 


d = min{j : jylnT > Ayj} where A.T,j = 


7 


i>j 
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d measures how quickly the eigenvalues of decreasing. Valko et al. ( 2013l l showed 

that if < B for some B and if we set regularization parameter 7 = 1 /B and scaling 


parameter r] = y^ 2In 2TN/r], then the regret bound of SupKernelUCB is 0{\/BdT). They 
showed that for linear kernel d < d] Also, compared with GP-UCB, I(r^; /) > O(dlnlnT), 
which means KernelUCB achieves better regret bound than GP-UCB in agnostic case. 


4.3 Stochastic Contextual Bandits with Arbitrary Set of Policies 


4.3.1 Epoch-Greedy 


Epoch-Greedy ( Langford and Zhane . 2008li treats contextual bandits as a classification 
problem, and it solves an empirical risk minimization (ERM) problem to find the cur¬ 
rently best policy. One advantage of Epoch-Greedy is that the hypothesis space can be 
finite or even inhnite with finite VC-dimension, without an assumption of linear payoff. 


There are two key problems Epoch-Greedy need to solve in order to achieve low regret: 
1. how to get unbiased estimator from ERM; 2. how to balance exploration and exploitation 
when we don’t know the time horizon T. To solve the first problem, Epoch-Greedy makes 
explicit distinctions between exploration and exploitation steps. In an exploration step, it 
selects an arm uniformly at random, and the goal is to form unbiased samples for learning. 
In an exploitation step, it selects the arm based on the best policy learned from the explo¬ 
ration samples. Of course, Epoch-Greedy adopts the trick we described in Section [2] to get 
unbiased estimator. For the second problem, note that since Epoch-Greedy strictly separate 
exploration and exploitation steps, so if it already know T in advance then it should always 
explore for the first T' steps, and then exploit for the following T — T' steps. The reason 
is that there is no advantage to take an exploitation step before the last exploration step. 
However generally T is unknown, so Epoch-Greedy algorithm runs in a mini-batch style: it 
runs one epoch at a time, and within that epoch, it first performs one step of exploration, 
and followed by several steps of exploitation. The algorithm is shown in Algorithm 0 


Algorithm 7 Epoch-Greedy 

Require: exploitation steps given samples 

Init exploration samples Wq = {},ti = 1 
for £ = 1,2 ,... do 

t = ti \> One step of exploration 

Draw an arm at G K} uniformly at random 

Receive reward G [0,1] 

Wi = Wi_i U { xt , at , rat ) 

Solve ht = maxfee^ E(x,a,r„)eU/, 

^£- 1-1 S^Wt) -|- 1 

for t = ti + 1, — 1 do > s{W() steps of exploration 

Select arm at = hi{xt) 

Receive reward Xat G [0,1] 

end for 
end for 
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Different from the EXP4 setting, we do not assume an adversary environment here. 
Instead, we assume there is a distribution P over (x,r), where x ^ X \s the context and 
r = [ri, ...,rx] £ [0,1]'^ is the reward vector. At time t, the world reveals context Xt, and 
the algorithm selects arm at G {1, ...A} based on the context, and then the world reveals 
the reward of arm at- The algorithm makes its decision based on a policy/hypothesis 
/i G A : A —)■ {1,..., A}. A is the policy/hypothesis space, and it can be an infinite space 
such as all linear hypothesis in dimension d, or it can be a finite space consists of A = |A| 
hypothesis. In this survey we mainly focus on finite space, but it is easy to extend to infinite 
space. 

Let Zt = {xt, at, Vat) be the exploration sample, and = {Zi ,..., Z„}. The expected 
reward of a hypothesis h is 


R(Jl) '^[x,r)^p\^h{x)\ 


SO the regret of the algorithm is 


T 

Rt = sup TA(/l) — E Tat 
hen ^ 

The expectation is with respect to Z” and any random variable in the algorithm. 

Denote the data-dependent exploitation step count by s(Z”), so means that based 

on all samples from exploration steps, the algorithm should do s{Z^) steps exploitation. 
The hypothesis that maximizing the empirical reward is 


h{Zi) = argmaxN 
h&n ^ 

The per-epoch exploitation cost is defined as 


ratl{h{xt) = at) 
IIK 


^in{H,s) = E^n sup A(/i) - R{h{Z^)) s{Z^ 
\h&H 


When S{Z^) = 1 


Mn(A, 1) = Ezn sup R{h) - rCKZ'I)) 

\h&n 

The per-epoch exploration regret is less or equal to 1 since we only do one step exploration, 
so we would want to select a s(Zf') such that the per-epoch exploitation regret /U„(A, s) = 1. 
Later we will show how to choose s(Zp). 

Theorem 4.10 For all T, ni, L such that: T < L + regret of Epoch-Greedy 

is bounded by 

L L 

Rt “El L -\- Y.Mm,s) + TY,PHZ{)<nA 

e=i i=\ 
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The above theorem means that suppose we only consider the first L epochs, and for each 
epoch i, we use a sample independent variable to bound S{Zi), then the regret up to 
time T is bounded by the above. 

Proof based on the relationship between s(Z”) and ni, one of the following two events 
will occur: 

1. s{Zl) < rii for some i = 1,L 

2. s(Zf) > rii for all i = 1, ■■■, L 

In the second event, is the lower bound for s{Zl), so T < T + < ^‘ + Yld=i 

so the epoch that contains T must be less or equal to epoch L, hence the regret is less or 
equal to the sum of the regret in the first L epochs. Also within each epoch, the algorithm 
do one step exploration and then exploitation, so the regret bound when event 2 occurs 
is 


L 

Rt ,2 < L + ^ s) 

£=1 

The regret bound when event 1 occurs is Rt,i < T because the reward r G [0,1]. Together 
we get the regret bound 

L L L 

Rt ^ T P[s{Zi) < n,] + n P[s{Zi) > ne] ^^(1 + s)) 

i=i i=i 1=1 

L L 

< r ^ P[s{Zi) < n,] + L + ^ s) 

1=1 1=1 


Theorem 14.101 gives us a general bound, we now derive a specific problem-independent 
bound based on that. 

One essential thing we need to do is to bound sup^g-^ R{h) — R{h{Z^)). If hypothesis 
space R is finite, we can use finite class uniform bound, and if R is infinite, we can use 
VC-dimension or other inhnite uniform bound techniques. The two proofs are similar, and 
here to consistent with the original paper, we assume R is a, hnite space. 

Theorem 4.11 (Bernstein) If P{\Yi\ < c) = 1 and P{Yi) = 0, then for any t > 0, 

where <7^ = ^ Yh=i Var{Yi). 

Theorem 4.12 With probability 1 — 6, the problem-independent regret of Epoch-Greedy is 

Rt < cT^/^{Kln{\R\/6))R^ 


21 




Proof Follow Section [21 define R{h) = ^ Yli , the empirical sample reward of 

a hypothesis h. Also define Ri = ^ ^R[h) = R{h), and 

var(i?j) < E(i?^) 

= RKH{h{xi) = ai)rl 
< EK^I(h(xi) = ai) 

= EKh/K 
= K 


So the variance is bounded by K and we can apply Bernstein inequality to get: 

ne^ \ 

2K + 2ce/3] 


P{\R{h) — R{h)\ > e) < 2exp ^ — 
From union bound we have 


P ( sup \ R{h) — R{h)\ > e ] < 2N exp < — 


h£H 


ne 


2K + 2ce 13 


Set the right-hand side to 6 and solve for e we have, 


e = c 


K\n{N/6) 


n 


So, with probability 1 — (5, 


sup \R{h) — R{h)\ < c 
heu 


Kln{N/5) 


n 


Let h be the estimated hypothesis, and /i* be the best hypothesis, then with probability 

1-^, 


Rih) < R{h) + c 


Kln{N/5) 


n 


^ R{hif') P c 


So 


Kln(N/S) 


n 


Kln(N/S) 


< -R(/i*) -|- 2c 


Kln(N/S} 


n 


1) < 2c 

To make < 1, we can choose 

s{Zi) = [cW^/{Kln{N/6))\ 

Take n£ = \_d\/IjiK ln(A^/(i))J, then P[s(Zf) < n^] = 0. So the regret 

L 

Rt < L + ^ s) 


£=1 


< 2L 
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Now the only job is to find the L. We can pick a L such that T < Yld=i so T will also 
satisfy T < L + 

L 

T = Y,ni 

i=i 

L 

1=1 

L 

= c'Wl/iKHN/6)){J2VI)\ 

e=i 

= c"[y/l/{K\n{N/6))L^/^\ 

So 

L = c"[(ii:in(iv/5))^/^r2/3j 
Rt < c"'{Kln{N/d))^/^T^/^ 

Hence, with probability 1 — (5, the regret of Epoch-Greedy is 0((if ln(A^/5))^/^T^/^). ■ 

Compared to EXP4, Epoch-Greedy has a weaker bound but it converge with probability 
instead of expectation; Compared to EXP4.P, Epoch-Greedy has a weaker bound but it 
does not require the knowledge of T. 


4.3.2 RandomizedUCB 

Recall that in EXP4.P and Epoch-Greedy we are always competing with the best pol¬ 
icy/expert, and the optimal regret bound 0{y/KT In A^) scales only logarithmically in the 
number of policies, so we could boost the model performance by adding more and more 
potential policies to the policy set T-L. With high probability EXP4.P achieves the optimal 
regret, however the running time scales linearly instead of logarithmically in the number of 
experts. As a results, we are constrained by the computational bottleneck. Epoch-Greedy 
could achieve sub-linear running time depending on what assumptions we make about the T-L 
and ERM, ho wever the regret bou nd is 0{{KIn, which is sub-optimal. Ran¬ 


domizedUCB (|Dudik et ahl . |201l|), on the other hand, could achieve optimal regret while 


having a polylog(N) running time. One key difference compared to Epoch-Greedy is that it 
assigns a non-uniform distribution over policies, while Epoch-Greedy assigns uniform distri¬ 
bution when doing exploration. Also RandomizedUCB does not make explicit distinctions 
between exploration and exploitation. 

Similar to Epoch-Greedy, let A be a set of K arms {!,..., A}, and D be an arbitrary 
distribution over (x, r), where x E A is the context and r E [0,1]^ is the reward vector. Let 
Dx be the marginal distribution of D over x. At time t, the world samples a (xt,rt) pair 
and reveals xt to the algorithm, the algorithm then picks an arm at & A and then receives 
reward from the world. Denote a set of policies h : X ^ A hy Ti. The algorithm has 
access to Ti and makes decisions based on xt and Ti. The expected reward of a policy h gT-L 
is 


R(K) 'R(x,r)r^D['^h(x)] 
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and the regret is defined as 


Rt 


T 


sup TR{h) — E 
hen 


E 


r 


at 


Denote the sample at time thy Zt = {xt, at, Vat, Pat ) > where pa^ is the probability of choosing 
at at time t. Denote all the samples up to time t hy = {Zi,Zt}. Then the unbiased 
reward estimator of policy h is 


R{h) 


- E 

{x,a,r,p)&Z\ 


rJ.{h{x) 

p 


a) 


The unbiased empirical reward maximization estimator at time t is 

rl{h{x) = a) 


ht = argmax 
h&H 


E 


{x,a,r,p)&Z\ 


P 


RandomizedUCB chooses a distribution P over policies R which in turn induce distributions 
over arms. Define 


Wp{x,a)= ^ P{h) 

h{x)=a 

be the induced distribution over arms, and 

Wp^^{x, a) = (1 - Kp)Wp{x, a) + p 

be the smoothed version of ITp with a minimum probability of p. Define 


R{W) = Ef,_,,^olr-W(x)] 

R(w) = i y 

t ^ p 

{x,a,r,p)eZl 


To introduce RandomizedUCB, let’s introduce POLICYELIMINATION algorithm first. 
POLICYELIMINATION is not practical but it captures the basic ideas behind Randomize¬ 
dUCB. The general idea is to find the best policy by empirical risk. However empirical risk 
suffers from variance (no bias since we again adopt the trick in Section [2]), so POLICYE¬ 
LIMINATION chooses a distribution Pt over all policies to control the variance of R{h) for 
all policies, and then elii ninate policie s that are not likely to be optimal. 

By Minimax theorem Dudik et al.l ( 2011 1 proved that there always exists a distribution 
Pt satisfy the constrain in Algorithm [8l 


Theorem 4.13 (Preedman-style Inequality) Let yi,...,yT be a sequence of real-valued 
random variables. Let U, i? G R such that YlJ=i var[yt] < V, and for all t, yt — Rtiut] < R- 
Then for any d > 0 such that R < y^U/ \n{2/5), with probability at least 1 — <5, 


t=l t=l 


< 2^Vln{2/S) 
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Algorithm 8 POLIOYELIMINATION _ 

Require: 6 G (0,1] 

Define 5t = d/ANt'^, bt = | 

for T do 

Choose a distribution Pt over Pt-i s.t. V /i G Pt-i 


ExS-Dx 


1 


< 2K 


Sample at from W/ = Wp^ 

Receive reward 

Let 


-Ht 


!^h e Pt-i : 6tih) > 


I max 6t(h') 



m 


(11) 


end for 


Theorem 4.14 With probability at least 1 — 5, the regret of POLICYELIMINATION is 
bounded by: 


Rt = 0(16 


2rAln 


4r2A 
5 ^ 


Proof Let 


Hh) 


rtl{h{xt) = at) 

Wl{h{xt)) 


the estimated reward of policy h at time t. To make use of Freedman’s inequality, we need 
to bound the variance of Ri{h) 


vaic{Ri{h)) < ERi{h)‘^ 

^ p r|l(L(xt) = at) 
wmxtw 
^ y HHxt) = at) 

- Wl{h{xt)? 

= E-- 

Wl{h{xt)) 

< 2K 


The last inequality is from the constrain in Equation (jlOp . So 


t 

^var[Rj(/r)2] < 2Kt = Vt 

i=l 
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Now we need to check if Rt satisfy the constrain in Theorem 14.131 Let to be the first t such 
that fjLt < 1/2K. when t > to, then for all t' < t, 


Rt'{h) < l/nt' < 1/ltt 




2Kt 



ln(l/5t) 


So now we can apply Freedman’s inequality and get 


P{\R{h) - R{h)\ >bt)<26t 


Take the union bound over all policies and t 


sup sup P{\R{h) — R{h)\ > bt) < 2N <5^/ 


t'& h&H 



So with probability 1 — 5, we have 


\R{h) - R{h)\ < bt 


When t < to, then nt < 1/2K and bt > > 2, then the above bound still holds since 

reward is bounded by 1. 

To sum up, we make use of the convergence of Ylt F construct 6t so that the union 
bound is less than <5, and we use Rt’s constrain in Freedman’s inequality to construct ut 
and Freedman’s inequality to construct bt- 

Lemma 4.15 With probability at least 1 — 5, 


\R{h)-R{h)\ < bt 


From Lemma 14.151 we have 


Rih) -bt< R{h) < R{h*) < R{h*) + bt 
R{h) < R{h*) + 2bt 


where h* = max/jg-^ R{h*). So we can see that h* is always in Rt after the policy elimination 
step (Equation [11]) in Algorithm (Sj Also, if R{h) < R{h*) — Abt, then 


R{h) -bt< R{h) < R{h*) - Abt < R{h*) + - Abt 

R{h) < R{h*) - 2bt 
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However, as we can see from the elimination step, all the policies which satisfy R{h) < 
R{h*) — 2bt is eliminated. So for all the remaining policies h G Rt, we have R{h*) — R{h) < 
Abt, so the regret 


Rt <^R{h*)- R{h) 


t=i 

T 

< 4^6 

t=l 


< 2K\ri. 


4iVt2 


^ 1 


, 4iVt2 

< 2K\ii—^2Vt 


< IQ\ 2TKin- 


4iVr2 


POLICYELIMINATION describes the basic idea of RandomizedUCB, however POLI- 
CYELIMINATION is not practical because it does not actually show how to hnd the distri¬ 
bution Pt, also it requires the knowledge of D^- To solve these problems, RandomzedUCB 
always considers the full set of policies and use an argmax oracle to find the distribution Pt 
over all policies, and instead of using the algorithm uses history samples. Define 


Ad{W) = R{h*)-R{W) 
= R{ht) - R{W) 


RandomizedUCB is described in Algorithm [H Similar to POLICYELIMINATION, Pt in 
RandomizedUCB algorithm is to control the variance. However, instead of controlling each 
policy separately, it controls the expectation of the variance with respect to the distribution 
Q. The right-hand side of Equation (fT^ is upper bounded by cAt-i{WQ)‘^, which mea¬ 
sures the empirical performance of distribution Q. So the general idea of this optimization 
problem is to bound the expected variance of empirical reward with respect to all possible 
distribution Q, whereas if Q achieves high empirical reward then the bound is tight hence 
the variance is tight, and if Q has low empirical reward, the bound is loose . This makes 
sure that Pt puts more weight on policies with low regret. iDudik et al.l (j201lh showed that 
the regret of RandomizedUCB is 0{^TK hi{TN/5)). 

To solve the optimization problem in the algorithm, RandomizedUCB uses an argmax 
oracle(AA40) and relies on the ellipsoid method. The main contribution is the following 
theorem: 


Theorem 4.16 In each time t RandomizedUCB makes 0{t^K‘^\T?{^)) calls to AA40, 
and requires additional 0{t‘^K‘^) processing time. The total running time at each time t is 
0{t^K‘^ln^{^)lnN), whieh is sub-linear. 
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Algorithm 9 RandomizedUCB 

Define Wq = {}, Ct = 21n(^), = min|^,y^| 

for T do 

Solve the following optimization problem to get distribution Pt over T-L 

h&H 

s.t. for all distribution Q over P: 

Sample at from VD/ = Wp^ 

Receive reward 

Wt = Wt-i U {xt,at,rat,Wl{at)) 

end for 




t-1 




4.3.3 ILOVETOCONBANDITS 


(need more details) 

Similar to RandomizedUCB, Importance-weighted LOw-Variance Epoch-Ti med Oracleized 


CONt extual BANDITS algorithm (ILOVETOCONBANDITS) proposed bv lAearwal et al 


( 20141 ) aims to run in time sub-linear with respect to N (total number of policies) and 
achieves optimal regret bound 0{y/KT In A). RandomizedUCB makes 0{T^) calls to AA40 
over all T steps, and ILOVETOCONBANDITS tries to further reduce this time complexity. 


Theorem 4.17 ILOVETOCONBANDITS achieves optimal regret bound, requiring 0(y ) 

calls to AMO over T rounds, with probability at least 1 — 6. 

Let A be a finite set of K actions, x E A be a possible contexts, and r E [0,1]^ be the 
reward vector of arms in A. We assume (x, r) follows a distribution D. Let 11 be a finite set 
of policies that map contexts x to actions a E A, let Q be a distribution over all policies 11, 
and be the set of all possible Q. ILOVETOCONBANDITS is described in Algorithm 

The Sample{xt,Qm,-i,'^Tm-i^ Tm-i) function is described in Algorithm fTTl it samples 
an action from a sparse distribution over policies. 

As we can see, the main procedure of ILOVETOCONBANDITS is simple. It solves an 
optimization problem on pre-specified rounds ri,r 2 ,... to get a sparse distribution Q over 
all policies, then it samples an action based on this distribution. The main problem now is 
to choose an sparse distribution Q that achieves low regret and requires calls to AMO as 
little as possible. 

Let Rt(7r) be the unbiased reward estimator of policy vr over the first t rounds (see section 
ED, and let = arg max.„. (vr), then the estimated empirical regret of tt is Regti^n) = 
RtiT^t) — R(vr). Given a history Ht and minimum probability fj,m, and define b^^ = 
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Algorithm 10 ILOVETOCONBANDITS _ 

Require: Epoch schedule 0 = tq < ri < r 2 < ..., (5 € (0,1) 

Initial weights Qo = 0, m = 1, /Xm = min{^, A/ln(16r^|n|/5)/(Arm)} 
for t = 1, 2,T do 

{at,pt{at)) = Sample{xt,Qm-i,T^T,n-i^ 

Pull arm at and receive reward rt G [0,1] 
if t = Tm then 

Let Qm be a solution to (OP) with history Ht and minimum probability pm 
m = m + 1 

end if 
end for 


Algorithm 11 Sample 
Require: x,Q,p 

for TT G n and Q{'n') > 0 do 
Pn{x) = (1 - Kp)Q{tt)+P 

end for 

Randomly draw action a from p 
return {a,Pa) 


for 'll; = 100, then the optimization problem is to find a distribution Q G such that 


Vtt G n : E 


^ Q{7r)b^ < 2K 

ttGII 

1 

Qf^’^{Tr{x)\x) 


X'^Ht 


<2K + b^ 


(13) 

(14) 


where is the smoothed version of Q with minimum probability pm- 

Note that bj^ is a scaled version of empirical regret of tt, so Equation (jl3p is actually a 
bound on the expected empirical regret with respect to Q. This equation can be treated as 
the exploitation since we want to choose a distribution that has low empirical regret. Equa¬ 
tion m, similar to RandomizedUCB, is a bound on the variance of the reward estimator 
of each policy tt G 11. If the policy has low empirical regret, we want it to have smaller 
variance so that the reward estimator is more accurate, on the other hand, if the policy has 
high empirical regret, then we allow it to have a larger variance. 


Agarwal et al.l (j2014l i showed that this optimization problem can be solved via coor¬ 


dinate descent with at most 0{^/Kt/ ln{N/S)) calls to AMO in round t, moreover, the 
support (non-zeros) of the resulting distribution Q at time t is at most 0(y^ Kt/ \ti{N/5)) 
policies, which is the same as the number of calls to AMO. This results sparse Q and hence 
sub-linear time complexity for Sample procedure. 


Agarwal et all (j2014l i also showed that the requirement of r is that r^+i — Tm = 0{Tm). 
So we can set Tm = 2”^“^, then the total number of calls to AMO over all T round is only 
0{^Kt/ \n{N/5)), which is a vast improvement over RandomizedUCB. 
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Theorem 4.18 With probability at least 1 — 6, the regret of ILOVETOCONBANDITS is 


0{y/KT ln{TN/6) + K hi{TN/6)) 


5. Adversarial Contextual Bandits 

In adversarial contextual bandits, the reward of each arm does not necessarily follow a fixed 
probability distribution, and it can be picked by an adversary against the agent. One way 
to solve adversarial contextual bandits problem is to model it with expert advice. In this 
method, there are N experts, and at each time t each expert gives advice about which arm 
to pull based on the contexts. The agent has its own strategy to pick an arm based on all 
the advice it gets. Upon receiving the reward of that arm, the agent may adjust its strategy 
such as changing the weight or believe of each expert. 


5.1 EXP4 


Exponentia l -weight Algorithm for Exploration and Exploitation using Expert advice (EXP4) 
Auer et al. ( 2002bl l assumes each expert generates an advice vector based on the current 
context xt at time t. Advice vectors are distributions over arms, and are denoted by 


^ [0) il,j indicates expert i’s recommended probability of playing arm j at 
time t. The algorithm pulls an arm based on these advice vectors. Let rt G [0,1]^ be the 
true reward vector at time t, then the expected reward of expert z is • r^. The algorithm 
competes with the best expert, which achieves the highest expected cumulative reward 


T 

Gmax = max Ct ■ rt 
t=l 


The regret is defined as: 


T T 

Rt = max T] - E T] n,at 

t=\ t=l 

The expectation is with respect to the algorithm’s random choice of the arm and any 
other random variable in the algorithm. Note that we don’t have any assumption on the 
distribution of the reward, so EXP4 is a adversarial bandits algorithm. 

EXP4 algorithm is described in Algorithm 1121 Note that the context xt does not appear 
in the algorithm, since it is only used by experts to generate advice. 

If an expert assigns uniform weight to all actions in each time t, then we call the expert 
a uniform expert. 

Theorem 5.1 For any family of experts which includes a uniform expert, EXP4’s regret is 
bounded by 0{\/TK In A). 

Proof The general idea of the proof is to bound the expected cumulative reward E Y^=i rt,at j 
then since Gmax is bounded by the time horizon T, we can get a bound on Gmax — 
EEr=iU,af 
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Algorithm 12 EXP4 


Require: 7 G (0,1] 

Set wt^i = 1 for i = 1,A 

for t = 1, 2,T do 

Get expert advice vectors each vector is a distribution over arms. 

for j = 1, K do 


Pt,j = (1 - 7) 


, 7 


i=i Ei=i wt,i ^ 

end for 

Draw action at according to pt, and receive reward 

for j = 1, ...,K do > Calculate unbiased estimator of rt 

rt,j = —Mi = at) 

Pt,j 

end for 

for i = 1, ...,N do i> Calculate estimated expected reward and update weight 


yt,i — 

Wt+i,i = Wtexp^yt^i/K) 


end for 
end for 





Let Wt = Xlili and qt,i = then 


Wt+i 


N 

E 




Wt ^ Wt 
2 = 1 

N 


exp(7yt,i/iL) 


2 = 1 
N 


< 


2=1 


^ Qt,i 1 + ^yt,i + {e- ‘2){^yt,i) 


7 . x2 

— 7;j- .• I 

K 


N 


< 1 + X] + (® “ 2 ) (-^) ‘ii’iyl 


7 


2=1 
TV 


K 


N 


2 = 1 
o TV 


- ^ + (e “ 2) (^) ^ gt,, 


2=1 


yli 


2=1 


(15) 


(16) 


Equation [15] is due to e* < 1 + x + (e — 2)x^ for x < 1, Equation [16] is due to 1 + x < 
Taking logarithms and summing over t 


In 


Wt+i 

lTi 


^J^YY «*-*^7* + (e - 2) (^) Qt,iyl 


N 


t=l i=l 


T N 


i=l i=l 


(17) 


For any expert k 


In 


Wt+1 

lEi 


> In 


WT+l,k 

lEi 


T 

= + 2^{j7yt,i) - InlTi 

t=i 


1_ 

K 


T 

'^yt,i-\nN 

t=i 


Together with Equation [T3 we get 


T N T 

EE qt,iyt,i > Y 

t=l i=l t=l 


KlnN , . r 

-"-2 

7 K 


TV 


EE 

t=l i=l 


qt,iyli 


(18) 
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Now we need to bound Qt,iyt,i and Qt,iyti- From the definition of pt^i we have 

Z^i=l j ~ 1-7 ; 




SO 


N N / K 

1=1 i=l \j=l 

= E(E* 4 ,)n. 

j=i \i=i ) 

K 




< 


where at is the arm pulled at time t. 

N 


i=i 

^t,at 

1-7 


N 


yt,iyt,i — 


t,ajt 


i=l 


i=l 

N 


< 


^QtdCt, 


\2-2 


t,atJ t,at 


i=l 

N 

i=l 

^ -2 Pt,at 


t,at^t,at 


< 


n,, 


1-7 


Together with Equation [18] we have 
T T 

E > (1 - 7) E 

t=l t=l 

T 

> (1 - 7) E y^’>^ 

t=i 


KlnN 

7 

ETlnlV 

7 


(l-7)-(e-2)^EE^7i 

t=i j=i 

T K 

7 


('-2)7^EE4 


t=i j=i 


Taking expectation of both sides of the inequality we get 
T T 


E E > (1 - 7) E y^^^ 

t=l t=l 

T 

> (1 -7)E^‘’^ 

t=l 


KlnN 

7 

ETlnA^ 

7 


t=i j=i 

t=l i=l 


(19) 
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Since there is a uniform expert in the expert set, so Gmax > 'K '^f=i Gexp 4 = 

Ylt=i : then Equation [19] can be rewritten as 


T 

hjGexp4 ^ (1 T) 'y ^ yt,k 
t=l 


KlnN 

7 


(e ‘^^'yGjYiax 


For any k. Let k be the arm with the highest expected reward, then we have 


hjGea;p4 ^ (1 'y^G^nax 

Gmax ElGexp4 ^ 


KlnN 


7 

KlnN 


7 


(e ‘^)^Gmax 
+ (e - 1 ) 7 G 

max 


We need to select a 7 such that the right-hand side of the above inequality is minimized 
so that the regret bound is minimized. An additional constrain is that 7 < 1. Taking the 
derivative with respect to 7 and setting to 0 , we get 


7 * = min < 1 ,. 


' KlnN 
(e — l)Gm. 


Gmax — EGexp4 < 2.63 -\/GmaxK In N 


Since Gmax < T, we have Rt = 0{y/TK In A"). One important thing to notice is that to get 
such regret bound it requires the knowledge of T, the time horizon. Later we will introduce 
algorithms that does not require such knowledge. ■ 


5.2 EXP4.P 


The unbiased estimator of the reward vector us ed by EXP4 has h igh variance due to the 


increased range of the random variable Vat/pat ijDudik et al.l . 12014^. and the regret bound 


of EXP4, 0{y/TKInN), is hold only with expectation. EXP4.P ( Bevgelzimer et ahl . boill ) 
improves this result and achieves the same regret w ith high probability. To do this, EXP4.P 


combines the idea of both UCB ( Auer et ah . 2nn2al ) and EXP4. It computes the confidence 


interval of the reward vector estimator and hence bound the cumulative reward of each 
expert with high probability, then it designs an strategy to weight each expert. 

Similar to the EXP4 algorithm setting, there are K arms {1,2,K} and N experts 
{1,2,..., N}. At time t G {l,...,r}, the world reveals context xt, and each expert i outputs 
an advice vector representing its recommendations on each arm. The agent then selects 
an arm at based on the advice, and an adversary chooses a reward vector r*. Finally the 
world reveals the reward of the chosen arm rt^at ■ Let Gi be the expected cumulative reward 
of expert i: 




t=i 
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let pj be the algorithm’s probability of pulling arm j, and let r be the estimated reward 
vector, where 

/ 

fjlVj if j = at 
0 if i / at 


n = 


let Gi be the estimated expected cumulative reward of expert i: 

T 

G^ = Y.Cl-r 


t=i 


let GexpA.p be the estimated cumulative reward of the algorithm: 


Gexpi.p — ^ ^ ^at 
t=l 

then the expected regret of the algorithm is 

— max dr2 ^G^xpi.p 

i 

However, we are interested in regret bound which hold with arbitrarily high probability. 
The regret is bounded by e with probability 1 — <5 if 


P I I max Gj — G, 


>€] <5 


expA.p 


We need to bound Gt — Gi with high probability so that we can bound the regret with high 
probability. To do that, we need to use the following theorem: 


Theorem 5.2 Let he a sequence of real-valued random variables. Suppose that 

Xt < R and E(W) = 0. Define the random variables 

T T 

s = Y,xt, v = Y,m!) 

t=i t=i 

then for any S, with probability 1 — S, we have 

' RPnjl/S) \ 
e-2 ’ 

RPn{l/5) ] 
e-2 


U{e-2)Hl/5) + 

^f E'g 

|i?ln(l/<5) + (e-2)^ 

^/ E'g 


To bound Gi, let Xt = Q ' rt — Q • rt, so E(Xt) = 0, i? = 1 and 


E(x 2) < E(^i • rtf 


K 


= 

j=i 



2 


< 


K 


E 



Pt,j 


def . 

= Vt,i 
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The above proof used the fact that rtj < 1. Let V = KT, assume ln{N/6) < KT, and use 
5/N instead of 5, we can apply Theorem 15.21 to get 

Apply union bound we get: 

Theorem 5.3 Assume ln{N/S) < KT, and define Oi = V KT + Y^=i we have 
that with probability 1 — 5 


/ i V 

sup(Gi - Gi) < y In —ai 

The confidence interval we get from Theorem 15.31 is used to construct EXP4.P algorithm. 
The detail of the algorithm is described in Algorithm 1131 We can see that EXP4.P is very 
similar to EXP4 algorithm, except that when updating wt^i, instead of using estimated 
reward, we use the upper confidence bound of the estimated reward. 


Theorem 5.4 Assume that\'D.{N/ 5) < KT, and the set of experts ineludes a uniform expert 
which selects an arm uniformly at randomly at each time. Then with probability 1 — 5 

Rt = maxGj — GexpA.p < 6 -\/KT la{N/5) 

i 


Proof The proof is similar to the proof of regret bound of EXP4. Basically, we want to 
bound Gexpi.p = Ylt=i since we can bound maxj Gi with high probability, we then 

get the regret of EXP4.P with high probability. 

Let qt^i = ; 7 = C/ = maxj(Gj + ai-\/\\i{N/5)). We need the following 

inequalities 


'bt,i ^ ^/Pmin 


N 

i=l 


K 

1-7 


To see why this is true: 


N 


N K 


i=l i=l j=l 

K N 

j=l i=l 


K 

J = 1 

K 


1 


■7 


1-7 
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Algorithm 13 EXP4.P 


Require: 5 > 0 

Define pmin = set wi^i = 1 for f = 1, N. 

for t = 1, 2, ...T do 

Get expert advice vectors {Ct 
for j = 1, 2 ,K do 

Pt,j — (f I^Pmin) ^ ^ s-^N ' ' ^ Pmin 

i=i Ei=i wt,i 

end for 

Draw action at according to pt and receive reward . 

for j = 1 ,K do 

h,j = —I(j = at) 

Pt,j 

end for 

for i = 1, N do 


yt,i = Ct ■ h 

K 

Vt,i = '^Ctj/Ptj 


(Pmin / . , lln{N/5) 

Wt+i,i = Wt,i exp I I yt^i + Vt,i\j 


end for 
end for 





We also need the following two inequalities, which has been proved in Section 15.II 

N 


< Y 


i=l 

N 


7 


Qt,iyt,i < Y 


^t,at 


i=l 


Let b = and c = j then 


Wt+i 


N 

E 


Wt+l,i 


Wt ^ Wt 

2=1 

N 


qt,i exp{byt,i + cvt,i 


2=1 

Since e“ < 1 + o + (e — 2)a^ for a < 1 and e — 2 < 1, we have 
Wt+1 

^ 2_^ qtA^ ^yt,i -e 

2=1 






i=l 


TV 


TV 


TV 


N 


1 + & E ^*>*^*’*+E ^*-*'^h* + 21^^ E E 


2=1 


2=1 


2=1 


2=1 


< 1 + i,Ih2±. + c—^ + 26^:7^ + 2c2 


KT iL 
In iV 1 — 7 


1 — 7 1 — 7 1~7 

Take logarithms on both side, sum over T and make use of the fact that ln(l + x) < x we 
have 


In 


Wt+i 


Wi 


KT 262 


1 y 1-7^ 1-7 1-7^ 


h,at + 2 c" 


KT KT 
In 1 — 7 


Let G uni form be the estimated cumulative reward of the uniform expert, then 

T K 


Guniform — j 


EEr' 

t=i j=i 

^ 1 

Ela, 


t=i 


So 


In 


Wt+i\ . 6 KT 262 


Wi 


< 1 -E Wat + c- 

1 — 7 1 


+ 


t=l 

T 


< 


E’^ha. + 


-7 l-7tt 

KT 21: 


Y^kg 


uniform 


+ 2G 


KT KT 
In 1 — 7 


+ 


1 — 7 ’ 1 — 7 1 — 7 

' t=i ' ’ t=i 


y] AT7 + 2 c2 


KT KT 
In At 1 — 7 
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Also 


ln(WT+i) > max(lnt(;'r+i,i) 

i 



= bU - b^/KTln{N/6) 


So 

bU - b^KTln{N/6) -\nN < ^ y,KU + 

1 — 7 1 — 7 1 — 7 V in A' 1 — 7 

Gexpi.p > ( 1 - 2^^^ j U - \n{N/5) - 2VKTlnN - y^KTln{N/6) 


We already know from Theorem 15.21 that max* Gi < JJ with probability 1 — 5, and also 
maxj Gi < T, so with probability 1 — 5 


G. 


expA.p 


> max Gj — 2 


K\uN 


T - ln(iV/5) 


> maxGj — 67 KT ln{N/6) 

i 


VKT In IV - 2y/KTln{N/6) 


5.3 Infinite Many Experts 


Sometimes we have infinite number of experts in the expert set If. For example, an expert 
could be a d-dimensional vector (3 € and the predictive reward could be for some 
context X. Neither EXP4 nor EXP4.P are able to handle infinite experts. 

A possible sol ution is t o construct a finite approxima tion If to If, and then use EXP4 


or EXP4.P on ft ([Bartlettl . I20l4 iBeveelzimer et al 
there is a vr G If with 


201lh . Suppose for every expert tt G 11 


PiTT{xt) / Tt{xt)) < e 

where xt is the context and 7r(xt) is the chosen arm. Then the reward r G [0,1] satisfy 


E \r^ 


ir{xt) '7r(3;t)| ^ 
We compete with the best expert in If, the regret is 


< e 


i?T(n) = sup E ^ - E ^ ^ 


ttGII 


t=l 


t=l 
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And we can bound with i?r(n): 

T T T 

-RT(n) = sup E ^ - sup E ^ + sup E ^ ^ 


ttGII 


T{xt) 


t=l 


ttSII t=l 


sup inf E 


•s-en t=i 
T 


eE' 


t=l 


ttGII ttGII 


■K{xt) 


— r. 


:ixt)) + sup E ^ - E 7 ^ ra, 


t=i 


^■en t=i 


T 

t=i 


< Te + i?j'(n) 


There are many wa.vs to construct such 11. Here we talk about an algorithm called VE 
( Beveelzimer et ah . (2011 1. The idea is to choose an arm uniformly at random for the first r 
rounds, then we get r contexts xi, ...,Xr- Given an expert tt G H, we can get a sequence of 
prediction {7r(xi),..., TT{xr)}- Such sequence is enumerable, so we can construct H containing 
one representative vr for each sequence {7r(xi),..., 7r(xT-)}. Then we apply EXP4/EXP4.P 
on n. VE is shown in Algorithm 1141 


Algorithm 14 VE 
Require: r 

for t = 1, 2, ...r do 

Receive context xt 

Choose arm uniformly at random 

end for 

Construct 11 based on xi,...,Xr 

for t = T + 1,..., T do 
Apply EXP4/EXP4.P 
end for 


Theorem 5.5 For all policy sets H with VC dimension d, t = T {2d In ^ + In , with 
probability 1 — 6 

/ Z' T 2 

Rt < 9W2r ( din ^ + In - 

Proof Given vr G H and corresponding vr G H 

T 

= ^ I{Tr{xt) / TT{xt)) (20) 

t=r~\-l 

We need to measure the expected disagreements of vr and vr after time r. Suppose the total 
disagreements within time T is n, then if we randomly pick r contexts, the probability that 
TT and vr produce the same sequence is 

P(Vt e [l,T],ir(i,) = = (l “ “ T^) ■" “ tVTVt) 

_ nr 

< e r 


40 











From Sauer’s lemma we have that |n| < (^)‘^ for all r > d and the number of unique 
sequences produced by all tt G Ft is less than (^)‘^ for all r > d. For a tt € IT and 
corresponding tt G II, we have 


P 


T 

I(7r(xt) ^ Tr{xt)) 





< P 



T 

Yj > n and Vt G [l,T],7r'(xt) 


. . o nr 

< lITfe T 

/ er \ 2(i 

^(t) 



Set the right-hand side to | and we get: 


n > — ( 2d In — -|- In -- 

T \ d Q 

Together with Equation ([20|) . we get with probability 1 — | 

T f eT 2\ 

‘^max(n) - G'max(n) “ 7 ^ J 

Now we need to bound G From Sauer’s lemma we have that |n| < for all 

r > d, so we can directly apply EXP4.P’s bound. With probability 1 — | 


Ge.p 4 .p(n,r - r) > " 6\/2(r - r)(dln(^) + ln(|)) 


d 


Pinally, we get the bound on Gve 

P f f'P 2 \ t p7" 2 

Gve > Gmax(n) ~ ^ ~ ~ ~ ^\J j)) 

Setting r = Y^d~^(2dlm^'^|-lm|^ we get 

I eT ^ 

Gve > Ginax(n) - 9^ 2T(dln — + In -) 


6. Conclusion 

The nature of contextual bandits makes it suitable for many machine learning applications 
such as user modeling, Internet advertising, search engine, experiments optimization etc.. 
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and there has been a growing interests in this area. One topic we haven’t covered is the 
offline evaluation in contextual bandits. This is tricky since the policy evaluated is different 
from the policy that genera ting the data, so the arm proposed offline does not necessary 
match the one pulled online. Li et al.l ( 2011 1 proposed an unbiased offlin e evaluation method 


assuming that the logging policy selects arm uniformly at random. Strehl et al. ( 20ld l 


proposed an methods that will estimate the probability of the logging polic y selecting each 
arm, a nd then adopt inverse propensity score(IPS) to evaluation new policy, Langford et al.l 
( 201 il l proposed an method that combines the direct method and IPS to improve accuracy 
and reduce variance. 

Finally, note that regret bound is not the only criteria for bandits algorithm. First of all, 
the bounds we talked about in this surve y are problem-independent bounds, and there are 
problem-dependent bounds. For example. iLangford and Zhand (l2008l l proved that although 
the Epoch-Greedy’s problem-independent bound is not optimal, it can achieve a ©(InT) 
problem-dependent bound; Second, different bandits algorithms have their own different 
assumptions (stochastic/adversarial, linearity, number of policies, Bayesian etc.), so when 
choosing which one to use, we need to choose the one matches our assumptions. 
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